[SYCL] Optional kernel features: implement split based on reqd-work-group-size#8056
Conversation
…-work-group-size This patch implements device code split based on reqd-work-group-size attribute, enables generation of "reqd_work_group_size" property in "SYCL/device requirements" property set, and adds support of reqd-work-group-size to sycl::is_compatible Design: https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OptionalDeviceFeatures.md#changes-to-the-device-code-split-algorithm E2E tests: TBA
|
/verify with intel/llvm-test-suite#1528 |
| ; RUN: FileCheck %s -input-file=%t_0.sym --check-prefixes CHECK-SYCL-SYM | ||
| ; RUN: FileCheck %s -input-file=%t_esimd_0.sym --check-prefixes CHECK-ESIMD-SYM | ||
| ; RUN: FileCheck %s -input-file=%t_esimd_large_grf_1.sym --check-prefixes CHECK-ESIMD-LargeGRF-SYM | ||
| ; RUN: FileCheck %s -input-file=%t_esimd_large_grf_0.ll --check-prefixes CHECK-ESIMD-LargeGRF-IR |
There was a problem hiding this comment.
It is ok to change this test and test sycl-large-grf.ll below: by adding new optional kernel feature reqd-work-group-size to internal data structures, re-hashing happened which affected name generation logic. This does not affect customers as customers don't use these temp files directly, and didn't break anything in the pipeline.
|
For some reason these tests fail in CI but pass locally. Investigating this. Upd: fixed |
llvm/test/tools/sycl-post-link/device-code-split/per-reqd-wg-size-split-2.ll
Outdated
Show resolved
Hide resolved
llvm/test/tools/sycl-post-link/device-requirements/reqd-work-group-size.ll
Outdated
Show resolved
Hide resolved
AlexeySachkov
left a comment
There was a problem hiding this comment.
sycl-post-link part LGTM
|
/verify with intel/llvm-test-suite#1528 |
steffenlarsen
left a comment
There was a problem hiding this comment.
Runtime changes LGTM!
Known issue: #8098 |
|
SYCL :: USM/copy2d.cpp in Jenkins/llvm-test-suite (Windows, Level Zero) is unrelated to this patch and looks flaky, because in the same time the same test passed in intel/llvm-test-suite#1528 in Jenkins/llvm-test-suite |
) #### Intro This is a refactoring of how we perform device code split in `sycl-post-link`, which is intended to solve several existing issues with the current implementation: 1. increased peak RAM consumption by `sycl-post-link` 2. bad scaling with more and more split "dimensions" being added 3. increased tests maintenance cost due to non-deterministic order (between commits) of output files produced by `sycl-post-link` #### A bit more context about the issues above: (1) Increase peak RAM consumption is caused by the fact that we currently preserve **all** splits in-memory, even though we can process them on-by-one and discard them as soon as we stored them to a disk. This was implemented as a memory consumption optimization in #5021, but it got accidentally reverted in #7302 as an attempt to workaround (2). (2) is pretty much summarized in our source code: https://github.com/intel/llvm/blob/afebb2543ccecb89f83c84b68fba7616bbab89ac/llvm/tools/sycl-post-link/sycl-post-link.cpp#L806-L811 (3) is caused by a bad implementation decision made in #7302: because every split is now identified by a hash, every time you add a new split "dimension"/new feature to an account, it results in different hashes for existing tests. Just look how many unrelated tests had to be updated in #7512, #8056 and #8167 #### Now to the PR itself: It introduces a new infrastructure for categorizing/grouping kernel functions: instead of using hashes, we now build a string description for each kernel function and then group kernels with the same description string together. String description is built by a new entity: it accepts a set of rules, where each rule is a simple function which returns a string for passed `llvm::Function`. Results of all rules are concatenated together and rules are invoked in a stable order of their registration. There is a simple API for building those rules. It provides some predefined rules for the most popular use cases like turning a function attribute or a metadata into a string descriptor for the function. There is also a possibility to pass a custom callback there to implement more complicated logic. #### How does this PR help with issues above? (1) and (2) are fixed in conjunction: `sycl-post-link` was refactored to avoid storing more than one split module at a time and that is possible because the PR unifies per-scope and optional-kernel-features splitters into a single generic splitter. The new API for kernels categorization seems to be flexible enough to provide that infrastructure so merged splitters still look OK code-wise. (3) is caused by using string identifiers instead of hashes as well as by using a data structure which sorts identifiers. #### Any other benefits from this PR? About 50 lines of code less to support :) Extending device code split for more optional features would be even easier than it is now: instead of adding several changes to various places around `UsedOptionalFeatures` structure, it will be just adding a 1-3 lines of code. Please also note that `UsedOptionalFeatures` contains tons of inconsistencies in its implementation, which will all gone with this PR: in `operator==` we don't use hash and instead compare certain fields directly (and we do miss some of them); `generateModuleName` method skips some of optional features and ignores them. Cross-module `device_global` usages checks should now work at all split dimensions (except for ESIMD). #### Any potential downsides? With current `UsedOptionalFeatures` there is a possibility to embed various information (used aspects, `large-grf` flag, etc.) directly during device code split to avoid re-gathering that information later when we generate properties. With the suggested approach, it would be harder to do, because it doesn't seem to naturally fit to the proposed infrastructure: see changes I did around `large-grf` in this PR. However, we have never actually implemented this and re-querying some metadata from function doesn't seem like a bottleneck, so it should really be a very minor and only theoretical downside.
This patch implements device code split based on reqd-work-group-size attribute, enables generation of "reqd_work_group_size" property in "SYCL/device requirements" property set, and adds support of reqd-work-group-size to sycl::is_compatible
Design:
https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OptionalDeviceFeatures.md#changes-to-the-device-code-split-algorithm
E2E tests: intel/llvm-test-suite#1528
This patch make these SYCL CTS tests pass:
[[sycl::reqd_work_group_size(4294967295)]]CHECK( sycl::is_compatible(builtinKernelIds, device) == (size_t(device.get_info<sycl::info::device::max_work_item_sizes<1>>()) > 4294967295) )